Arabic text categorization: a comparative study of different representation modes
نویسندگان
چکیده
The quantity of accessible information on Internet is phenomenal, and its categorization remains one of the most important problems. A lot of work is currently focused on English rightly since; it is the dominant language of the Web. However, a need arises for the other languages, because the Web is each day more multilingual. The need is much more pressing for the Arabic language. Our research is on the categorization of the Arabic texts, its originality relates to the use of a conceptual representation of the text. For that we will use Arabic WordNet (AWN) as a lexical and semantic resource. To comprehend its effect, we incorporate it in a comparative study with the other usual modes of representation (bag of words and Ngrams), and we use different similarity measures. The results show the benefits and advantages of this representation compared to the more conventional methods, and demonstrate that the addition of the semantic dimension is one of the most promising approaches for the automatic categorization of Arabic texts.
منابع مشابه
Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملA Comparative Study with Different Feature Selection For Arabic Text Categorization
Feature Selection benefits a learner by eliminating non-informative or noisy features and by reducing the overall feature space to a manageable size. The Term Feature Selection is used in Machine Learning for the process of selecting a subset of features used to represent the text. In this paper, we propose a new approach for Text Representation based on incorporating background Knowledge Arabi...
متن کاملA comparative study of the text inside the Mihrabi rug by Zareh Penyamin and Topkapi Palace Museum according to the existing discourse in the 16th and 19th
IIn the country of Turkey, in the city of Hereke, at the end of the 19th century, rugs known as Mihrabi became popular, which were inspired by the rugs of the Safavid era and kept in the Topkapi Palace Museum. In these rugs, which are reproduced in royal workshops on a large scale, some changes have been made in the verbal text and incorporated visual elements. Among the rugs that seem to have ...
متن کاملA Comparative Study in Relation to the Translation of the Linguistic Humor
Mark Twain made use of repetition and parallelism as two comedic literary devices to bring comic effect to the readers. Linguistic devices of humor, repetition and parallelism seemed to create many difficulties in the translation of literary texts. The present study applied Delabatista‟s strategies for translating wordplays such as repetition and parallelism in the translation of humorous texts...
متن کاملNew stemming for arabic text classification using feature selection and decision trees
In this paper we conduct a comparative study between two stemming algorithms: khoja stemmer and our new stemmer for Arabic text classification (categorization), using Chisquare statistics as feature selection and focusing on decision tree classifier. Evaluation used a corpus that consists of 5070 documents independently classified into six categories: sport, entertainment, business, middle east...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Int. Arab J. Inf. Technol.
دوره 9 شماره
صفحات -
تاریخ انتشار 2012